Attention Mechanism from Scratch

Nikolay Oskolkov, SciLifeLab, NBIS Long Term Support, nikolay.oskolkov@scilifelab.se

Abstract

Biological sequences can be viewed as text and therefore analyzed using Natural Language Processing (NLP) techniques. In this notebook, I will compare text generation via Attention mechanism, underlying some Large Language Models (LLM) such as ChatGPT, with more traditional Markov Chain and Long Short Term Memory (LSTM) techniques. I will concentrate on deriving and coding attention mechanism from scratch and explaining its principles.

Table of Contents:

Background and motivation for Computational Biology

Analysis of text data is of great interest for Computational Biology because biological sequences (DNA, RNA, proteins etc.) can be viewed as a text.

In the DNA sequence below, blue segments are gene regions and the red segements are intergenic regions.

A good old way of detecting gene regions is the Hidden Markov Model (HMM), which is based on the concepts of transition and emission probabilities. Briefly, using annotated sets of genic and intergenic sequences and counting how many times A, C, G and T nucleotides are observed in each set, one can compute the emission matrix. Then, one can define a transition matrix via probabilities of changing between genic and intergenic states.

Finally, the Viterbi algorithm, which is the main HMM algorithm, uses a product of the emission and transition matrices in order to infer whether a certain sequence belongs to gene or intergenic region.

Similarly one can use the concept of Markov Chain and transition matrix for generating a DNA sequence. Below we compute the Markov Chain transition matrix which holds the probabilities of observing any of the four nucleotide, i.e. A, C, G and T, after each of the four nucleotide. Also, please note that the interpretation of Markov Chain transition matrix is slightly different from HMM transition matrix above.

Now, given an initial nucleotide, we can draw next nucleotides from multinomial (4-classes: A, C, G and T) distribution, and generate a nucleotide sequence of certain length as a Markov Chain.

The obtained nucleotide sequence should have nucleotide and dinucleotide frequencies similar to the real human DNA.

Finally, before going further, let us ask: if DNA sequence is a text, what would be then the words? It turns out that k-mers can be considered as words, i.e. building block of the DNA sentence / sequence. Indeed, sliding a window of a fixed size (k-mer size) along the DNA sequence of length L, we can represent it as L-k+1 words, which we for convienicence separate with spaces, see the figure below. Then words / k-mer frequencies can be used for a sentiment-like analysis, i.e. comparision of word / k-mer usage between case and control DNA, as in the Bag of Words NLP method.

Also, the words / k-mers can be represented as numeric vectors of arbitrary length (so-called Embeddings, which can be achieved with a trained Embeddings layer of a neural network) as it is implemented in the Word2Vec pre-trained NLP model which can also keep some relation by semntics between the words.

In the tSNE plot above, the words / k-mers are visualized in 2D space, where it is clear that AT-rich and GC-rish k-mers form two distinct clusters. Please keep in mind this possibility to encode words / k-mers as embedding vectors, we will use this later when explaining the concept of Attention.

Text generation with Markov Chain

Now, after we have explained the relevance of textual data analysis for Computational Biology, and demonstrated how a DNA sequence can be generated from Markov process given transition matrix, let us exercise a bit with real texts. In this section we will demonstrate how to generate text with a simple short-memory process such as Markov Chain that remembers only the previous word in a text. For demonstration purposes, we will be using the text from the first part of "Crime and Punishment" novel by Fedor Dostoevsky, which can be copied from the Gutenberg project.

We can use the whole text, but to properly visualize the transition matrix it is helpful to start with just one (not even compete) sentence.

After we have tokenized the sentence (please note that since we are aiming at text generation, in contrast to a typical NLP pre-processing, we are not going to remove punctution and stop words and use lemmatization), we can create a transition matrix by counting for each word, how many times one word follows another word. Then, after per word normalization by the total sum of observed word pairs, we get the transition matrix below, where the elements have an interpretation of probabilities of observing two words in a certain order.

Please note that all words but "and" and "he" have only one following word. In contrast, the word "and" can have two following words: "attendance," and "every", while "he" can have "went" and "was" with 50% chance of observing them following "he". Now, given the transition matrix, we can specify an initial word (for example, "provided"), and randomly draw next words from the multinomial 21-class distribution, similarly to the way we previously generated the nucleotide sequence.

We get almost a perfect text, especially in the beginning. However, please note how the meaning has slightly changed starting from the word "and". More specifically, instead of "... with garret, diners, and attendance ...", the sentance flipped to "... with garret, diners, and every time ...". This is because the Markov Chain had 50% chance to select either "attendance" (perfect match) or "every" (sub-optimal), and it selected "every". However, providing more text would likely improve the performance as it will be more and more obvious which word (i.e. "attendance" or "every") is more typical in this context. The counts in the transition matrix will grow with the amount of text.

Text generation with LSTM

In this section, we will demonstrate how to generate text with an LSTM neural network which presumably has a longer memory and can generate more meaningful texts. The design of LSTM was borrowed from materials avilable here and here. Again, for simplicity and more accurate comparison with the Markov Chain text generator, we will use not the whole text from part 1 of "Crime and Punishment" (we can if we want!) but only first three paragraphs with only a few of sentences.

Please note that we are not going to remove punctuation, stop words, and do stemming and lemmatization, i.e. we are not following the standard NLP text pre-processing steps. This is because our task is text generation, which does not require extensive cleaing of the input text.

Text generation requires a sequence input data, as given a sequence (of words / tokens) the aim is the predict next word / token.

Next, we need to tokenize our corpus. Tokenization is a process of extracting tokens (terms / words) from a corpus and encoding them as numbers. One can either manually tokenize the corpus or use special tool available at NLTK and Keras libraries. Keras has inbuilt model Tokenizer for tokenization which can be used to obtain the tokens and their index in the corpus.

We can see that we have more then hundred fifty words in the three first paragraphs of "Crime and Punishment". Now, once Tokenizer has been applied to the corpus, it can provide multiple outputs such as word counts, co-occurence matrix etc. Here, however, we are interested in one particular form of text which is a sequence of tokens, i.e. the corpus will be represented as a list of lists (one lists per sentence) where each word was converted into an integer.

In the above output, [11, 23], [11, 23, 34], [11, 23, 34, 35] and so on represent the N-gram phrases generated from the input text data, where every integer corresponds to the index of a particular word in the complete vocabulary of words present in the text. we need to prepare the data in this way as N-grams for the particular task of text generation. Later, we will use the last word as a label, and all the words prior the last word as data. Therefore, the task will be to use all the words prior the last word in order to predict the last word in a sentence.

After this step, every text document in the dataset is converted into sequence of tokens. Now that we have generated a data-set which contains sequence of tokens, it is possible that different sequences have different lengths. Therefore, before starting training the model, we need to pad the sequences and make their lengths equal. We can use pad_sequence function of Kears for this purpose.

To input this data into a learning model, we need to create predictors and label. We will create N-grams sequence as predictors and the next word of the N-gram as label.

Let us build a Sequential Keras model with the first layer as the word Embedding layer. Then, we will apply a bidirectional LSTM layer of neural net, where parameter return_sequence is marked as True so that the word generation keeps in consideration previous and the words coming ahead in the sequence. A dropout layer is added to avoid overfitting, one more LSTM layer and, one more dense layer with activation as Relu, and a regularizer to avoid over-fitting again are then added. The output layer has softmax so as to get the probability of the word to be predicted next.

Once the model has been trained, we can provide a seed text, and ask the model to complete the sentence by generating text after the seed.

The sentence generated with LSTM is not bad, however probably not better than the one generated by Markov Chain. Let us now modify the LSTM model into a Transformer model by including an Attention layer, and check whether the text generation can be imporved in this case. However, first, we will try to introduce and explain the Attention concept in the next section.

Explaining and programming Attention mechanism from scratch

The concept of Attention as well as Transformer model was presented for the first time in 2017 at NEURIPS conference as "Attention is All You Need" article, and later the preprint with more details has been deposited at Arxive.

The Attention was claimed to be a special operator (or a layer in a neural network) that could figure out hub-words of main importance for each sentence. This allowed to increase the memory of the model and account to longer context in sequential data such as text or biological sequences (for our case of interest).

Below, we are going to code a simple variant of Attention algorithm from scratch and explain how, and more importantly, why it should work. The summary below was inspired by materials from here, here and here.

Suppose we have a 3D vector representation (3D word embeddings) of 4 words. Recall from the first section, that words can be converted to vectors of arbitrary dimensions with Embeddings layer of a neural network, or pre-trained Word2Vec model that can also keep some semantic relationships between the words. Below, we will make up 3D vectors, i.e. 3D word embeddings for 4 words:

Attention is a sequence-to-sequence operation $Y = f (X)$: a sequence of vectors $X$ goes in, and a sequence of vectors $Y$ comes out. To produce output vectors $Y$, the attention operation simply takes a weighted average over all the input vectors. Here, however, the weights $W$ are not the parameters for learning but are expressed via the input vectors $X$ themselves. The simplest option for this weights $W$ is the dot product of $X^T$ and $X$, i.e. the covarience of the words. This allows all words interact with all other words.

Now, we need to somehow introduce the parameters to be learnt during training. For this purpose, we are going to build three (3 x 3) matrices $W_Q$, $W_K$ and $W_V$, which we then multiply by the input vectors $X$, $X^T$ and $X$, respectively. The resulting products are called queries (Q), keys (K) and values (V) using the Attention terminology. Below, I provide the matrix dimensions in the parentences in order to keep track of final dimensionality of the resulting product. The elements of the Q, K and V matrices are the training parameters to be learnt during training:

$$\Large Q(4x3) = X(4x3) * W_Q(3x3)$$$$\Large K(3x4) = W_K(3x3) * X^T(3x4)$$$$\Large V(4x3) = X(4x3) * W_V(3x3)$$

Below, we will assign random float values to $W_Q$, $W_K$ and $W_V$, and perform their multiplcation with the data $X$ in order to obtain queries, keys and values. Please note that the dimensions of $W_Q$, $W_K$ and $W_V$ are (3 x 3), i.e. they are square matrices of the size of embedding vectors of the words data X.

Now, instead of $X * X^T * X$, after their multiplication by $W_Q$, $W_K$ and $W_V$ (please note that $W_Q$ and $W_V$ multiply $X$ from the right, while $W_K$ multiplies $X^T$ from the left), we will get the product of queries, keys and values as $Q * K * V$. The product of queries and keys $Q * K$ is called attention weights. It contains information about how much the words are related to each other.

The attention weights $Q * K$ are normalized by softmax operation and have the meaning of probabilities after this tranformation.

Finally, all we need to do is to multiply the attention weights by the transformed $X$ data matrix, which is the matrix of values $V$. The result is a matrix of the same dimension as the initial $X$ data matrix.

Therefore, after we have applied Attention operator to data matrix $X$, we have got a matrix of the same dimensions but slightly modified to take into account the relations between all the words. Also, as the Attention theory suggests, more related words have higher weights which is coded in the final attention matrix.

Text generation with LSTM with Attention layer

In this section we will demonstrate how to generate text with an LSTM neural network with added Attention layer.

The architecture of the Transformer neural network is exactly the same as the LSTM model in theprevious section, the only difference is that the attention layer model.add(SeqSelfAttention()) has been added, which brings more fitting parameters.

It seems that including attention did not improve much the LSTM performance.